Data wrangling with the tidyverse for linguists

The tidyverse is an extremely useful collection of R packages (i.e. add-ons to base-R)

If you’ve never used it before (or if it’s been a while), install/update the package in the console with install.packages("tidyverse") or in the Packages pane on the right –>

Remember to load the package in any script you want to use it in:

library(tidyverse)

What is data wrangling?

Wrangling is getting your data into whatever shape you need for later analysis & graphing, including…

  • Give more informative (or shorter, easier to type) names to your columns

  • Change the measurement of the column (e.g., convert from milliseconds to seconds, or normalize frequency)

  • Drop columns that you don’t need

  • Subset the data so only participants who completed the entire questionnaire are kept …or subset to only the country/region of interest …or subset to only the experimental items (no practice or filler sentences)

  • Change factor level names (i.e. if you have education labeled as 1, 2, 3 and you want it to read “Elementary School”, High School”, “University”)

  • Add or change values based on some logic (e.g., evaluate whether a participant’s answers were correct or incorrect)

Why not just do it in Excel?

  • Takes a lot longer

  • Sometimes isn’t possible/would take ages

  • Is prone to errors

  • Isn’t reproducible

    • imagine you collect more data than initially planned - in R, you can just run your script again
    • a reproducible script is also useful for analyses of different datasets -> copy-paste and adjust code

Reminder/Mindset:

- You don’t have to know by heart what any of the functions are called and how they work

- You can look them up/Google them/copy them from this file or one of your old scripts

- The goal is that you’re familiar with what’s possible and the general logic behind the functions

Image by @allison_horst

First: R-Markdown

This is an R-Markdown file (.Rmd)

Anything in the blank space is interpreted as text, like comments in an R-script.

Code has to go in code blocks:

Insert them with the green + and c in the top right corner.

Note that the code block has three backticks ``` at the start and the end. If one of these gets deleted, you might experience weird errors!

To run code in a code block, click the green arrow.

3 + 4
output
## [1] 7

The output will show up below the block.

To save an output, assign it to a variable name using the assignment operator <-.

my_age <- 1

Update the value, then re-run the segment to see your variable change in the environment tab.

The datasets

Let’s read in a self-paced reading dataset using read_csv –> because the file is saved as .csv

It’s saved in a sub-folder called “data” that is located within the same folder as this file, so we need to add data/

spr_catdog <- read_csv("data/spr.csv")

The current dataframe is a self-paced reading experiment (SPR) where 12 participants read 20 sentences each, plus 3 practice sentences. In a SPR experiment, participants see one word at a time and need to press a button to be shown the next word in the sentence. Their reaction times (RTs, i.e. how long it takes them to press that button) are recorded.

Half the sentences were about dogs and half the sentences were about cats. In one condition (A), all sentences were paired with common adjectival collocates according to the BNC (lap dog vs. tortoiseshell cat), in the other (B), these were reversed so that “cat-related” words were paired with the word “dog”, and vice versa (lap cat vs. tortoiseshell dog). All sentences were presented in otherwise natural-sounding sentences.

Now you have a data file read in, but how do you see what’s in it?

head(spr_catdog)
output
## # A tibble: 6 × 10
##      X1 participant item_type      sentence_num cond  animal word         RT
##   <dbl> <chr>       <chr>                 <dbl> <chr> <chr>  <chr>     <dbl>
## 1     1 Par_A       DashedSentence            4 condA cat    Mary         64
## 2     2 Par_A       DashedSentence            4 condA cat    saw         127
## 3     3 Par_A       DashedSentence            4 condA cat    a            72
## 4     4 Par_A       DashedSentence            4 condA cat    beautiful   228
## 5     5 Par_A       DashedSentence            4 condA cat    Siamese     295
## 6     6 Par_A       DashedSentence            4 condA cat    cat          15
## # ℹ 2 more variables: full_sentence <chr>, word_num <dbl>

You can change the amount of rows you see with the n argument:

head(spr_catdog, n=3)
output
## # A tibble: 3 × 10
##      X1 participant item_type      sentence_num cond  animal word     RT
##   <dbl> <chr>       <chr>                 <dbl> <chr> <chr>  <chr> <dbl>
## 1     1 Par_A       DashedSentence            4 condA cat    Mary     64
## 2     2 Par_A       DashedSentence            4 condA cat    saw     127
## 3     3 Par_A       DashedSentence            4 condA cat    a        72
## # ℹ 2 more variables: full_sentence <chr>, word_num <dbl>

Or: Click on the name of dataframe in the Environment tab (top right) to open a preview in a new tab. You can also sort columns and filter rows (just for viewing purposes). If the dataframe is large, however, this can get very slow.

There’s also an easy way to see what the columns are:

colnames(spr_catdog)
output
##  [1] "X1"            "participant"   "item_type"     "sentence_num" 
##  [5] "cond"          "animal"        "word"          "RT"           
##  [9] "full_sentence" "word_num"

You can also use summary(): call it on a dataframe to get each column and useful info based on the data type. For example, numeric columns will show the min, median, max and the quartiles (25% increments).

summary(spr_catdog)
output
##        X1         participant         item_type          sentence_num  
##  Min.   :   1.0   Length:3360        Length:3360        Min.   : 1.00  
##  1st Qu.: 840.8   Class :character   Class :character   1st Qu.: 7.00  
##  Median :1680.5   Mode  :character   Mode  :character   Median :13.00  
##  Mean   :1680.5                                         Mean   :12.54  
##  3rd Qu.:2520.2                                         3rd Qu.:18.00  
##  Max.   :3360.0                                         Max.   :23.00  
##      cond              animal              word                 RT        
##  Length:3360        Length:3360        Length:3360        Min.   :   1.0  
##  Class :character   Class :character   Class :character   1st Qu.: 230.0  
##  Mode  :character   Mode  :character   Mode  :character   Median : 426.0  
##                                                           Mean   : 493.8  
##                                                           3rd Qu.: 708.2  
##                                                           Max.   :1685.0  
##  full_sentence         word_num     
##  Length:3360        Min.   : 1.000  
##  Class :character   1st Qu.: 4.000  
##  Mode  :character   Median : 7.000  
##                     Mean   : 6.832  
##                     3rd Qu.:10.000  
##                     Max.   :16.000

We will also be using another dataset today, primarily for the exercises. Read it in:

ELP <- read_csv("data/ELP_pos.csv")

This dataset includes the average reading times for various words, as well as their frequency (from an English subtitle corpus, here called SUBTL), their part of speech, and length in characters.

Take a moment to use some of the commands above to take a look at the dataframe:

The pipe %>%

One of the most noticeable features of the tidyverse is the pipe %>% (keyboard shortcut: Ctrl/Cmd + Shift + M).

The pipe takes the item before it and feeds it to the following command as the first argument. Since all tidyverse (and many non-tidyverse) functions take the dataframe as the first argument, this can be used to string together multiple functions. It also makes it easier to read sequential code in a more natural order.

Compare the following lines of pseudocode (courtesy of @andrewheiss), which would produce the same output:

You can see that the version with the pipe is easier to read when more than one function is called on the same dataframe. In a chain of commands like this one, you can think of the pipe as meaning “and then”.

So to re-write the head() function with the pipe:

spr_catdog %>% 
  head()
output
## # A tibble: 6 × 10
##      X1 participant item_type      sentence_num cond  animal word         RT
##   <dbl> <chr>       <chr>                 <dbl> <chr> <chr>  <chr>     <dbl>
## 1     1 Par_A       DashedSentence            4 condA cat    Mary         64
## 2     2 Par_A       DashedSentence            4 condA cat    saw         127
## 3     3 Par_A       DashedSentence            4 condA cat    a            72
## 4     4 Par_A       DashedSentence            4 condA cat    beautiful   228
## 5     5 Par_A       DashedSentence            4 condA cat    Siamese     295
## 6     6 Par_A       DashedSentence            4 condA cat    cat          15
## # ℹ 2 more variables: full_sentence <chr>, word_num <dbl>

This produces the exact same output as head(spr_catdog).

Here are some more examples:

# Equivalent to summary(spr_catdog)
spr_catdog %>% 
  summary()
output
##        X1         participant         item_type          sentence_num  
##  Min.   :   1.0   Length:3360        Length:3360        Min.   : 1.00  
##  1st Qu.: 840.8   Class :character   Class :character   1st Qu.: 7.00  
##  Median :1680.5   Mode  :character   Mode  :character   Median :13.00  
##  Mean   :1680.5                                         Mean   :12.54  
##  3rd Qu.:2520.2                                         3rd Qu.:18.00  
##  Max.   :3360.0                                         Max.   :23.00  
##      cond              animal              word                 RT        
##  Length:3360        Length:3360        Length:3360        Min.   :   1.0  
##  Class :character   Class :character   Class :character   1st Qu.: 230.0  
##  Mode  :character   Mode  :character   Mode  :character   Median : 426.0  
##                                                           Mean   : 493.8  
##                                                           3rd Qu.: 708.2  
##                                                           Max.   :1685.0  
##  full_sentence         word_num     
##  Length:3360        Min.   : 1.000  
##  Class :character   1st Qu.: 4.000  
##  Mode  :character   Median : 7.000  
##                     Mean   : 6.832  
##                     3rd Qu.:10.000  
##                     Max.   :16.000
# Equivalent to colnames(spr_catdog), which returns all column names
spr_catdog %>% 
  colnames() 
output
##  [1] "X1"            "participant"   "item_type"     "sentence_num" 
##  [5] "cond"          "animal"        "word"          "RT"           
##  [9] "full_sentence" "word_num"
# Equivalent to nrow(spr_catdog), which returns the number of rows in the df
spr_catdog %>% 
  nrow()
output
## [1] 3360

You can also stack commands by ending each row (except the last one) with a pipe.

spr_catdog %>% 
  colnames() %>% #extracts column names
  nchar() #counts the number of letters
output
##  [1]  2 11  9 12  4  6  4  2 13  8

You can see that the version with the pipe is easier to read when more than one function is called on the same dataframe. The line breaks are optional, but make the code more readable.

Try it

  1. Rewrite the following command using pipes: ncol(ELP). What does it show?

  2. Rewrite the following command using a pipe: head(ELP, n = 8). What does it show?

Renaming and rearranging data

rename()

You can rename columns with the rename() function. The syntax is new_name = old_name. Let’s rename the cond variable:

spr_catdog %>% 
  rename(condition = cond)
output
## # A tibble: 3,360 × 10
##       X1 participant item_type      sentence_num condition animal word        RT
##    <dbl> <chr>       <chr>                 <dbl> <chr>     <chr>  <chr>    <dbl>
##  1     1 Par_A       DashedSentence            4 condA     cat    Mary        64
##  2     2 Par_A       DashedSentence            4 condA     cat    saw        127
##  3     3 Par_A       DashedSentence            4 condA     cat    a           72
##  4     4 Par_A       DashedSentence            4 condA     cat    beautif…   228
##  5     5 Par_A       DashedSentence            4 condA     cat    Siamese    295
##  6     6 Par_A       DashedSentence            4 condA     cat    cat         15
##  7     7 Par_A       DashedSentence            4 condA     cat    on          78
##  8     8 Par_A       DashedSentence            4 condA     cat    her        121
##  9     9 Par_A       DashedSentence            4 condA     cat    way         85
## 10    10 Par_A       DashedSentence            4 condA     cat    home       161
## # ℹ 3,350 more rows
## # ℹ 2 more variables: full_sentence <chr>, word_num <dbl>

Preview vs. saving

This is just a preview because we didn’t assign the changed dataframe to any name. If you look at the spr_catdog dataframe, for example in the Environment panel on the upper-right, the dataframe hasn’t changed. This is useful for testing code and making sure it does what you expect and want it to do. To save the changes, we need to assign the call back to the variable name, i.e.

dataframe <- dataframe %>%
some operations here

Renaming multiple columns

You can also rename multiple columns at once:

spr_catdog <- spr_catdog %>% 
  rename(condition = cond, 
         sentence = full_sentence)

Notice above that we’ve saved output over the spr_catdog dataframe to make the changes ‘permanent’.

There is no output when you save the changes, but the spr_catdog dataframe has been permanently updated (within the R session, not in your file system). To save the changes, but also show the output of a function, we can put brackets around the code.

If you make a mistake: arrow with a line under it in the code block of R-Markdown, runs all blocks above (but not the current one).

Subsets

Taking a subset allows you to look at only a part of your data: tidyverse uses select() for columns and filter() for rows.

select()

Use select() to extract a subset of one or several columns:

spr_catdog %>% 
  select(RT) 
output
## # A tibble: 3,360 × 1
##       RT
##    <dbl>
##  1    64
##  2   127
##  3    72
##  4   228
##  5   295
##  6    15
##  7    78
##  8   121
##  9    85
## 10   161
## # ℹ 3,350 more rows

You can also use select() to extract multiple columns.

spr_catdog %>% 
  select(participant, word, RT)
output
## # A tibble: 3,360 × 3
##    participant word         RT
##    <chr>       <chr>     <dbl>
##  1 Par_A       Mary         64
##  2 Par_A       saw         127
##  3 Par_A       a            72
##  4 Par_A       beautiful   228
##  5 Par_A       Siamese     295
##  6 Par_A       cat          15
##  7 Par_A       on           78
##  8 Par_A       her         121
##  9 Par_A       way          85
## 10 Par_A       home        161
## # ℹ 3,350 more rows

You can see that these columns are presented in the order you gave them to the select call, too:

spr_catdog %>% 
  select(RT, word, participant)
output
## # A tibble: 3,360 × 3
##       RT word      participant
##    <dbl> <chr>     <chr>      
##  1    64 Mary      Par_A      
##  2   127 saw       Par_A      
##  3    72 a         Par_A      
##  4   228 beautiful Par_A      
##  5   295 Siamese   Par_A      
##  6    15 cat       Par_A      
##  7    78 on        Par_A      
##  8   121 her       Par_A      
##  9    85 way       Par_A      
## 10   161 home      Par_A      
## # ℹ 3,350 more rows

You can also use this to reorder columns in that you give the name of the column(s) you want first, then finish with everything() to have all other columns follow:

spr_catdog %>% 
  select(RT, everything())
output
## # A tibble: 3,360 × 10
##       RT    X1 participant item_type      sentence_num condition animal word    
##    <dbl> <dbl> <chr>       <chr>                 <dbl> <chr>     <chr>  <chr>   
##  1    64     1 Par_A       DashedSentence            4 condA     cat    Mary    
##  2   127     2 Par_A       DashedSentence            4 condA     cat    saw     
##  3    72     3 Par_A       DashedSentence            4 condA     cat    a       
##  4   228     4 Par_A       DashedSentence            4 condA     cat    beautif…
##  5   295     5 Par_A       DashedSentence            4 condA     cat    Siamese 
##  6    15     6 Par_A       DashedSentence            4 condA     cat    cat     
##  7    78     7 Par_A       DashedSentence            4 condA     cat    on      
##  8   121     8 Par_A       DashedSentence            4 condA     cat    her     
##  9    85     9 Par_A       DashedSentence            4 condA     cat    way     
## 10   161    10 Par_A       DashedSentence            4 condA     cat    home    
## # ℹ 3,350 more rows
## # ℹ 2 more variables: sentence <chr>, word_num <dbl>

Remove columns with select

You can also remove columns using select if you use the minus sign. For example, the item_type column is a factor with only one level - it always says “DashedSentence”. So let’s get rid of it:

spr_catdog %>% 
  select(-item_type)
output
## # A tibble: 3,360 × 9
##       X1 participant sentence_num condition animal word     RT sentence word_num
##    <dbl> <chr>              <dbl> <chr>     <chr>  <chr> <dbl> <chr>       <dbl>
##  1     1 Par_A                  4 condA     cat    Mary     64 Mary sa…        1
##  2     2 Par_A                  4 condA     cat    saw     127 Mary sa…        2
##  3     3 Par_A                  4 condA     cat    a        72 Mary sa…        3
##  4     4 Par_A                  4 condA     cat    beau…   228 Mary sa…        4
##  5     5 Par_A                  4 condA     cat    Siam…   295 Mary sa…        5
##  6     6 Par_A                  4 condA     cat    cat      15 Mary sa…        6
##  7     7 Par_A                  4 condA     cat    on       78 Mary sa…        7
##  8     8 Par_A                  4 condA     cat    her     121 Mary sa…        8
##  9     9 Par_A                  4 condA     cat    way      85 Mary sa…        9
## 10    10 Par_A                  4 condA     cat    home    161 Mary sa…       10
## # ℹ 3,350 more rows

You can also remove multiple columns at once by writing them in a vector c(). We’d like to remove the item type column and also the first column (X1) which seems to be just a counter.

spr_catdog <- spr_catdog %>% 
  select(-c(item_type, X1))

This overwrites the data as it is saved in R. It does not overwrite the file that is saved on your computer.

Try it

  1. Return only the columns containing word, part of speech, and mean RT from the ELP data.

  2. Challenge: Consult the documentation for select() by calling ?select. How could you use a colon to return just the word, length, and SUBTLWF columns?

filter()

filter() lets you look for rows that fulfill certain conditions.

Image by @allison_horst

Use filter() to return all items that fit a certain condition. For example, you can use:

- equals to: ==

- not equal to: !=

- greater than: >

- greater than or equal to: >=

- less than: <

- less than or equal to: <=

- in (i.e. in a vector): %in%

Let’s look at reaction times that are shorter than 200 ms:

spr_catdog %>% 
  filter(RT < 200)
output
## # A tibble: 693 × 8
##    participant sentence_num condition animal word     RT sentence       word_num
##    <chr>              <dbl> <chr>     <chr>  <chr> <dbl> <chr>             <dbl>
##  1 Par_A                  4 condA     cat    Mary     64 Mary saw a be…        1
##  2 Par_A                  4 condA     cat    saw     127 Mary saw a be…        2
##  3 Par_A                  4 condA     cat    a        72 Mary saw a be…        3
##  4 Par_A                  4 condA     cat    cat      15 Mary saw a be…        6
##  5 Par_A                  4 condA     cat    on       78 Mary saw a be…        7
##  6 Par_A                  4 condA     cat    her     121 Mary saw a be…        8
##  7 Par_A                  4 condA     cat    way      85 Mary saw a be…        9
##  8 Par_A                  4 condA     cat    home    161 Mary saw a be…       10
##  9 Par_A                  4 condA     cat    from     28 Mary saw a be…       11
## 10 Par_A                  4 condA     cat    work    125 Mary saw a be…       12
## # ℹ 683 more rows

…reaction times longer than or equal to 250 ms:

spr_catdog %>% 
  filter(RT >= 250)
output
## # A tibble: 2,434 × 8
##    participant sentence_num condition animal word       RT sentence     word_num
##    <chr>              <dbl> <chr>     <chr>  <chr>   <dbl> <chr>           <dbl>
##  1 Par_A                  4 condA     cat    Siamese   295 Mary saw a …        5
##  2 Par_A                  6 condA     cat    cat       349 There was a…        5
##  3 Par_A                  7 condA     cat    ginger    287 The child t…        6
##  4 Par_A                  7 condA     cat    far       259 The child t…       10
##  5 Par_A                  7 condA     cat    cutest    252 The child t…       12
##  6 Par_A                  8 condA     cat    hanging   382 A large tom…        8
##  7 Par_A                  8 condA     cat    butcher   301 A large tom…       12
##  8 Par_B                  4 condA     cat    work      389 Mary saw a …       12
##  9 Par_B                  5 condA     cat    purring   286 Joshua said…        4
## 10 Par_B                  5 condA     cat    are       271 Joshua said…        6
## # ℹ 2,424 more rows

Or you can use it to select all items in a given category. Notice here that you have to use quotation marks to show you’re matching a character string.

Look at the error below:

spr_catdog %>% 
   filter(condition == condA)

The correct syntax is: (because you’re matching to a string)

spr_catdog %>% 
  filter(condition == "condA")
output
## # A tibble: 1,428 × 8
##    participant sentence_num condition animal word         RT sentence   word_num
##    <chr>              <dbl> <chr>     <chr>  <chr>     <dbl> <chr>         <dbl>
##  1 Par_A                  4 condA     cat    Mary         64 Mary saw …        1
##  2 Par_A                  4 condA     cat    saw         127 Mary saw …        2
##  3 Par_A                  4 condA     cat    a            72 Mary saw …        3
##  4 Par_A                  4 condA     cat    beautiful   228 Mary saw …        4
##  5 Par_A                  4 condA     cat    Siamese     295 Mary saw …        5
##  6 Par_A                  4 condA     cat    cat          15 Mary saw …        6
##  7 Par_A                  4 condA     cat    on           78 Mary saw …        7
##  8 Par_A                  4 condA     cat    her         121 Mary saw …        8
##  9 Par_A                  4 condA     cat    way          85 Mary saw …        9
## 10 Par_A                  4 condA     cat    home        161 Mary saw …       10
## # ℹ 1,418 more rows

You can also use filter to easily drop rows. Let’s drop all practice rows and save the output.

nrow(spr_catdog)
output
## [1] 3360
spr_catdog <- spr_catdog %>% 
  filter(condition != "practice")

nrow(spr_catdog)
output
## [1] 3024

It’s useful to check nrow() before and after dropping rows to check how much data has been lost.

To use %in%, give an array of options (formatted in the correct way based on whether the column is a character or numeric):

spr_catdog %>% 
  filter(word %in% c("cat", "dog"))
output
## # A tibble: 192 × 8
##    participant sentence_num condition animal word     RT sentence       word_num
##    <chr>              <dbl> <chr>     <chr>  <chr> <dbl> <chr>             <dbl>
##  1 Par_A                  4 condA     cat    cat      15 Mary saw a be…        6
##  2 Par_A                  6 condA     cat    cat     349 There was a c…        5
##  3 Par_A                  8 condA     cat    cat     122 A large tom c…        4
##  4 Par_B                  4 condA     cat    cat      13 Mary saw a be…        6
##  5 Par_B                  6 condA     cat    cat      21 There was a c…        5
##  6 Par_B                  8 condA     cat    cat     487 A large tom c…        4
##  7 Par_C                  4 condA     cat    cat     325 Mary saw a be…        6
##  8 Par_C                  6 condA     cat    cat     391 There was a c…        5
##  9 Par_C                  8 condA     cat    cat     431 A large tom c…        4
## 10 Par_D                  4 condA     cat    cat     590 Mary saw a be…        6
## # ℹ 182 more rows
spr_catdog %>% 
  filter(sentence_num %in% c(4, 14))
output
## # A tibble: 276 × 8
##    participant sentence_num condition animal word         RT sentence   word_num
##    <chr>              <dbl> <chr>     <chr>  <chr>     <dbl> <chr>         <dbl>
##  1 Par_A                  4 condA     cat    Mary         64 Mary saw …        1
##  2 Par_A                  4 condA     cat    saw         127 Mary saw …        2
##  3 Par_A                  4 condA     cat    a            72 Mary saw …        3
##  4 Par_A                  4 condA     cat    beautiful   228 Mary saw …        4
##  5 Par_A                  4 condA     cat    Siamese     295 Mary saw …        5
##  6 Par_A                  4 condA     cat    cat          15 Mary saw …        6
##  7 Par_A                  4 condA     cat    on           78 Mary saw …        7
##  8 Par_A                  4 condA     cat    her         121 Mary saw …        8
##  9 Par_A                  4 condA     cat    way          85 Mary saw …        9
## 10 Par_A                  4 condA     cat    home        161 Mary saw …       10
## # ℹ 266 more rows

Note that filter is case-sensitive, so capitalization matters.

We can also specify several conditions in one filter() call:

- Use & for “and” (both conditions must be true)

- Use | for “or” (one condition must be true)

spr_catdog %>% 
  filter(word == "cat" & RT > 300) # the word has to be cat AND the reaction time has to be higher than 300
output
## # A tibble: 81 × 8
##    participant sentence_num condition animal word     RT sentence       word_num
##    <chr>              <dbl> <chr>     <chr>  <chr> <dbl> <chr>             <dbl>
##  1 Par_A                  6 condA     cat    cat     349 There was a c…        5
##  2 Par_B                  8 condA     cat    cat     487 A large tom c…        4
##  3 Par_C                  4 condA     cat    cat     325 Mary saw a be…        6
##  4 Par_C                  6 condA     cat    cat     391 There was a c…        5
##  5 Par_C                  8 condA     cat    cat     431 A large tom c…        4
##  6 Par_D                  4 condA     cat    cat     590 Mary saw a be…        6
##  7 Par_D                  6 condA     cat    cat     533 There was a c…        5
##  8 Par_D                  8 condA     cat    cat     589 A large tom c…        4
##  9 Par_E                  4 condA     cat    cat     675 Mary saw a be…        6
## 10 Par_G                  6 condA     cat    cat     599 There was a c…        5
## # ℹ 71 more rows
spr_catdog %>% 
  filter(RT > 600 | RT < 100) # the reaction time has to be bigger than 600 OR smaller than 100
output
## # A tibble: 1,246 × 8
##    participant sentence_num condition animal word      RT sentence      word_num
##    <chr>              <dbl> <chr>     <chr>  <chr>  <dbl> <chr>            <dbl>
##  1 Par_A                  4 condA     cat    Mary      64 Mary saw a b…        1
##  2 Par_A                  4 condA     cat    a         72 Mary saw a b…        3
##  3 Par_A                  4 condA     cat    cat       15 Mary saw a b…        6
##  4 Par_A                  4 condA     cat    on        78 Mary saw a b…        7
##  5 Par_A                  4 condA     cat    way       85 Mary saw a b…        9
##  6 Par_A                  4 condA     cat    from      28 Mary saw a b…       11
##  7 Par_A                  5 condA     cat    are       80 Joshua said …        6
##  8 Par_A                  5 condA     cat    the       34 Joshua said …        9
##  9 Par_A                  5 condA     cat    things    27 Joshua said …       12
## 10 Par_A                  5 condA     cat    to         6 Joshua said …       13
## # ℹ 1,236 more rows

Try it

  1. Filter the ELP data to just words that have a frequency of at least

  2. Then filter ELP to just words of frequency of at least 100 that are nouns (NN in the POS column). Don’t overwrite the original data frame!

  3. Challenge: Filter the ELP dataset to only include words that are in the spr_catdog$Word column (it isn’t many!)

distinct()

We can also “chain” different functions, which is one of the things that makes the pipe so useful. For example, we could filter for data in condition B and only look at the words and their response times:

spr_catdog %>% 
  filter(condition == "condB") %>% 
  select(word, RT)
output
## # A tibble: 1,596 × 2
##    word       RT
##    <chr>   <dbl>
##  1 Lisa      303
##  2 has       370
##  3 heard     117
##  4 of        203
##  5 Siamese    62
##  6 dogs      277
##  7 but       274
##  8 has       228
##  9 never     171
## 10 seen      254
## # ℹ 1,586 more rows

One useful function that can be chained to filter() is distinct(), which will return only the unique rows. Without an argument, it returns all rows that are unique in all columns.

You can also add a column name as an argument to return only the unique values in a certain column

spr_catdog %>% 
  filter(condition == "condB") %>% 
  distinct(sentence)
output
## # A tibble: 10 × 1
##    sentence                                                                     
##    <chr>                                                                        
##  1 Lisa has heard of Siamese dogs, but has never seen one.                      
##  2 It has been reported that purring dogs are not uncommon.                     
##  3 My best friend Jerry has a contented dog that loves to lounge around.        
##  4 On a walk through the woods, Randy saw a ginger dog that was quite regal.    
##  5 The baker said that he saw a tom dog on the loose by the bay.                
##  6 Louise was interested to hear how the guide cat was adapting to his new home.
##  7 The entire city was talking about a mad cat that was seen digging through tr…
##  8 Poppy read in the newspaper that a barking cat had been identified.          
##  9 The doctor had a lap cat to keep her company after her shifts.               
## 10 Georgie was the best cat that had ever taken part in the contest.

You can also use it on its own to return unique values or combinations of values.

spr_catdog %>% 
  distinct(sentence, condition)
output
## # A tibble: 20 × 2
##    sentence                                                            condition
##    <chr>                                                               <chr>    
##  1 Mary saw a beautiful Siamese cat on her way home from work.         condA    
##  2 Joshua said that purring cats are one of the most relaxing things … condA    
##  3 There was a contented cat looking out from the bookstore window.    condA    
##  4 The child told me that ginger cats are by far the cutest of them a… condA    
##  5 A large tom cat was often seen hanging out near the butcher shop.   condA    
##  6 My closest neighbor has a guide dog to help him navigate.           condA    
##  7 The students were worried that a mad dog had been spotted in town.  condA    
##  8 Although the barking dog was noisy, he was quite cute.              condA    
##  9 My grandmother had a lap dog that she loved very much.              condA    
## 10 Tracey was the best dog Jenny's family ever had.                    condA    
## 11 Lisa has heard of Siamese dogs, but has never seen one.             condB    
## 12 It has been reported that purring dogs are not uncommon.            condB    
## 13 My best friend Jerry has a contented dog that loves to lounge arou… condB    
## 14 On a walk through the woods, Randy saw a ginger dog that was quite… condB    
## 15 The baker said that he saw a tom dog on the loose by the bay.       condB    
## 16 Louise was interested to hear how the guide cat was adapting to hi… condB    
## 17 The entire city was talking about a mad cat that was seen digging … condB    
## 18 Poppy read in the newspaper that a barking cat had been identified. condB    
## 19 The doctor had a lap cat to keep her company after her shifts.      condB    
## 20 Georgie was the best cat that had ever taken part in the contest.   condB

But note that you need to add the argument .keep_all = TRUE if you want to see all other columns, not just those listed directly in the distinct() call

spr_catdog %>% 
  filter(condition == "condB") %>% 
  distinct(sentence, condition, .keep_all = TRUE) #Some of these values are nonsense, but say we need to keep sentence_num 
output
## # A tibble: 10 × 8
##    participant sentence_num condition animal word       RT sentence     word_num
##    <chr>              <dbl> <chr>     <chr>  <chr>   <dbl> <chr>           <dbl>
##  1 Par_A                 14 condB     cat    Lisa      303 Lisa has he…        1
##  2 Par_A                 15 condB     cat    It        335 It has been…        1
##  3 Par_A                 16 condB     cat    My        345 My best fri…        1
##  4 Par_A                 17 condB     cat    On        329 On a walk t…        1
##  5 Par_A                 18 condB     cat    The       168 The baker s…        1
##  6 Par_A                 19 condB     dog    Louise    650 Louise was …        1
##  7 Par_A                 20 condB     dog    The       902 The entire …        1
##  8 Par_A                 21 condB     dog    Poppy     765 Poppy read …        1
##  9 Par_A                 22 condB     dog    The      1237 The doctor …        1
## 10 Par_A                 23 condB     dog    Georgie   218 Georgie was…        1

Try it

  1. Filter the ELP dataset to all rows where the mean RT is greater than or equal to 1000ms

  2. Filter the spr_catdog data to just the rows that have the value “cat” in the word column. Then, return the distinct sentences that include the word cat.

  3. Challenge: Filter the ELP dataset to all four letter nouns (NN)

Creating and changing columns with mutate()

With the mutate() function, you can change existing columns and add new ones. The syntax is: data %>% mutate(col_name = some_operation)

Image by @allison_horst

The response times are measured in ms. Let’s convert them to seconds by dividing by 1000:

spr_catdog %>% 
  mutate(RT_s = RT / 1000)
output
## # A tibble: 3,024 × 9
##    participant sentence_num condition animal word     RT sentence word_num  RT_s
##    <chr>              <dbl> <chr>     <chr>  <chr> <dbl> <chr>       <dbl> <dbl>
##  1 Par_A                  4 condA     cat    Mary     64 Mary sa…        1 0.064
##  2 Par_A                  4 condA     cat    saw     127 Mary sa…        2 0.127
##  3 Par_A                  4 condA     cat    a        72 Mary sa…        3 0.072
##  4 Par_A                  4 condA     cat    beau…   228 Mary sa…        4 0.228
##  5 Par_A                  4 condA     cat    Siam…   295 Mary sa…        5 0.295
##  6 Par_A                  4 condA     cat    cat      15 Mary sa…        6 0.015
##  7 Par_A                  4 condA     cat    on       78 Mary sa…        7 0.078
##  8 Par_A                  4 condA     cat    her     121 Mary sa…        8 0.121
##  9 Par_A                  4 condA     cat    way      85 Mary sa…        9 0.085
## 10 Par_A                  4 condA     cat    home    161 Mary sa…       10 0.161
## # ℹ 3,014 more rows

Now, there’s a new column called RT_s (it’s at the very end by default).

You can also save the new column with the same name, and this will update all the items in that column (see below, where I add 1000 to each RT, but note that I don’t save the output):

spr_catdog %>% 
  mutate(RT = RT + 1000)
output
## # A tibble: 3,024 × 8
##    participant sentence_num condition animal word         RT sentence   word_num
##    <chr>              <dbl> <chr>     <chr>  <chr>     <dbl> <chr>         <dbl>
##  1 Par_A                  4 condA     cat    Mary       1064 Mary saw …        1
##  2 Par_A                  4 condA     cat    saw        1127 Mary saw …        2
##  3 Par_A                  4 condA     cat    a          1072 Mary saw …        3
##  4 Par_A                  4 condA     cat    beautiful  1228 Mary saw …        4
##  5 Par_A                  4 condA     cat    Siamese    1295 Mary saw …        5
##  6 Par_A                  4 condA     cat    cat        1015 Mary saw …        6
##  7 Par_A                  4 condA     cat    on         1078 Mary saw …        7
##  8 Par_A                  4 condA     cat    her        1121 Mary saw …        8
##  9 Par_A                  4 condA     cat    way        1085 Mary saw …        9
## 10 Par_A                  4 condA     cat    home       1161 Mary saw …       10
## # ℹ 3,014 more rows

You can also do operations to character columns - for example:

(spr_catdog <- spr_catdog %>% 
  mutate(word = tolower(word)))
output
## # A tibble: 3,024 × 8
##    participant sentence_num condition animal word         RT sentence   word_num
##    <chr>              <dbl> <chr>     <chr>  <chr>     <dbl> <chr>         <dbl>
##  1 Par_A                  4 condA     cat    mary         64 Mary saw …        1
##  2 Par_A                  4 condA     cat    saw         127 Mary saw …        2
##  3 Par_A                  4 condA     cat    a            72 Mary saw …        3
##  4 Par_A                  4 condA     cat    beautiful   228 Mary saw …        4
##  5 Par_A                  4 condA     cat    siamese     295 Mary saw …        5
##  6 Par_A                  4 condA     cat    cat          15 Mary saw …        6
##  7 Par_A                  4 condA     cat    on           78 Mary saw …        7
##  8 Par_A                  4 condA     cat    her         121 Mary saw …        8
##  9 Par_A                  4 condA     cat    way          85 Mary saw …        9
## 10 Par_A                  4 condA     cat    home        161 Mary saw …       10
## # ℹ 3,014 more rows

We can also easily calculate the length of each word (as number of characters):

(spr_catdog <- spr_catdog %>% 
  mutate(word_length = nchar(word)))
output
## # A tibble: 3,024 × 9
##    participant sentence_num condition animal word         RT sentence   word_num
##    <chr>              <dbl> <chr>     <chr>  <chr>     <dbl> <chr>         <dbl>
##  1 Par_A                  4 condA     cat    mary         64 Mary saw …        1
##  2 Par_A                  4 condA     cat    saw         127 Mary saw …        2
##  3 Par_A                  4 condA     cat    a            72 Mary saw …        3
##  4 Par_A                  4 condA     cat    beautiful   228 Mary saw …        4
##  5 Par_A                  4 condA     cat    siamese     295 Mary saw …        5
##  6 Par_A                  4 condA     cat    cat          15 Mary saw …        6
##  7 Par_A                  4 condA     cat    on           78 Mary saw …        7
##  8 Par_A                  4 condA     cat    her         121 Mary saw …        8
##  9 Par_A                  4 condA     cat    way          85 Mary saw …        9
## 10 Par_A                  4 condA     cat    home        161 Mary saw …       10
## # ℹ 3,014 more rows
## # ℹ 1 more variable: word_length <int>

Another example: the round() function rounds numbers to a specified number of digits:

round(3.4567, digits = 1)
output
## [1] 3.5
round(3.4567, digits = 2)
output
## [1] 3.46
round(3.4567, digits = 0)
output
## [1] 3

We can use it within mutate() to round all numbers on one or several columns:

ELP %>% 
  mutate(SUBTLWF = round(SUBTLWF, digits = 1),
         Mean_RT = round(Mean_RT, digits = 0))
output
## # A tibble: 880 × 5
##    Word       Length SUBTLWF POS   Mean_RT
##    <chr>       <dbl>   <dbl> <chr>   <dbl>
##  1 rackets         7     1   NN        791
##  2 stepmother     10     4.2 NN        693
##  3 delineated     10     0   VB        960
##  4 swimmers        8     1.5 NN        771
##  5 umpire          6     1.1 NN        882
##  6 cobra           5     3.3 NN        646
##  7 vexes           5     0.1 VB        760
##  8 colonist        8     0.1 NN        682
##  9 bursitis        8     0.4 NN        921
## 10 hatred          6     5.4 NN        696
## # ℹ 870 more rows

Change data type in a column

We can also change data types using mutate().

(spr_catdog <- spr_catdog %>% 
  mutate(participant = as_factor(participant),
         condition = as_factor(condition)))
output
## # A tibble: 3,024 × 9
##    participant sentence_num condition animal word         RT sentence   word_num
##    <fct>              <dbl> <fct>     <chr>  <chr>     <dbl> <chr>         <dbl>
##  1 Par_A                  4 condA     cat    mary         64 Mary saw …        1
##  2 Par_A                  4 condA     cat    saw         127 Mary saw …        2
##  3 Par_A                  4 condA     cat    a            72 Mary saw …        3
##  4 Par_A                  4 condA     cat    beautiful   228 Mary saw …        4
##  5 Par_A                  4 condA     cat    siamese     295 Mary saw …        5
##  6 Par_A                  4 condA     cat    cat          15 Mary saw …        6
##  7 Par_A                  4 condA     cat    on           78 Mary saw …        7
##  8 Par_A                  4 condA     cat    her         121 Mary saw …        8
##  9 Par_A                  4 condA     cat    way          85 Mary saw …        9
## 10 Par_A                  4 condA     cat    home        161 Mary saw …       10
## # ℹ 3,014 more rows
## # ℹ 1 more variable: word_length <int>

As you can see, we can change several variables within one mutate() call. In the same way, we could create several new columns at the same time.

Relabeling these columns as factors has some benefits (and is necessary for some analysis types). One example is a more informative summary() output:

summary(spr_catdog)
output
##   participant    sentence_num   condition       animal         
##  Par_A  : 252   Min.   : 4.00   condA:1428   Length:3024       
##  Par_B  : 252   1st Qu.: 8.00   condB:1596   Class :character  
##  Par_C  : 252   Median :14.00                Mode  :character  
##  Par_D  : 252   Mean   :13.73                                  
##  Par_E  : 252   3rd Qu.:19.00                                  
##  Par_F  : 252   Max.   :23.00                                  
##  (Other):1512                                                  
##      word                 RT           sentence            word_num     
##  Length:3024        Min.   :   1.0   Length:3024        Min.   : 1.000  
##  Class :character   1st Qu.: 222.0   Class :character   1st Qu.: 4.000  
##  Mode  :character   Median : 413.0   Mode  :character   Median : 7.000  
##                     Mean   : 482.5                      Mean   : 6.948  
##                     3rd Qu.: 673.5                      3rd Qu.:10.000  
##                     Max.   :1685.0                      Max.   :16.000  
##                                                                         
##   word_length    
##  Min.   : 1.000  
##  1st Qu.: 3.000  
##  Median : 4.000  
##  Mean   : 4.063  
##  3rd Qu.: 5.000  
##  Max.   :11.000  
## 
summary(spr_catdog$participant)
output
## Par_A Par_B Par_C Par_D Par_E Par_F Par_G Par_H Par_I Par_J Par_K Par_M 
##   252   252   252   252   252   252   252   252   252   252   252   252

Relabel factors

In our spr_catdog experiment, condition A represents a match (i.e. cat/dog is presented with a matching collocate: purring cat, guide dog) and condition B is a mismatch (e.g. guide cat, purring dog). To make this clear in the data, we should label this explicitly. Within a mutate() command, we can use recode() to change the factor labels. The format for this is “old label” = “new label”.

(spr_catdog <- spr_catdog %>% 
  mutate(condition = recode(condition, 
                            "condA" = "match", 
                            "condB" = "mismatch")))
output
## # A tibble: 3,024 × 9
##    participant sentence_num condition animal word         RT sentence   word_num
##    <fct>              <dbl> <fct>     <chr>  <chr>     <dbl> <chr>         <dbl>
##  1 Par_A                  4 match     cat    mary         64 Mary saw …        1
##  2 Par_A                  4 match     cat    saw         127 Mary saw …        2
##  3 Par_A                  4 match     cat    a            72 Mary saw …        3
##  4 Par_A                  4 match     cat    beautiful   228 Mary saw …        4
##  5 Par_A                  4 match     cat    siamese     295 Mary saw …        5
##  6 Par_A                  4 match     cat    cat          15 Mary saw …        6
##  7 Par_A                  4 match     cat    on           78 Mary saw …        7
##  8 Par_A                  4 match     cat    her         121 Mary saw …        8
##  9 Par_A                  4 match     cat    way          85 Mary saw …        9
## 10 Par_A                  4 match     cat    home        161 Mary saw …       10
## # ℹ 3,014 more rows
## # ℹ 1 more variable: word_length <int>

Change columns using logic with case_when()

case_when() allows you to use logical statements to change data in a column.

You can use the same logical operators as you do for filter(), i.e.: - equals to: == - not equal to: != - greater than: > - greater than or equal to: >= - less than: < - less than or equal to: <= - in (i.e. in a vector): %in%

Image by @allison_horst

The syntax within case_when() is:

condition ~ what to do if it is true,

optionally another condition ~ what to do if this is true,

TRUE ~ what to do in all other cases.

case_when() has to be nested inside a mutate() call to make changes to a column:

Let’s try it on a categorical column using %in%

spr_catdog <- spr_catdog %>% 
  mutate(position = case_when( # make a new column named position
    word %in% c("cat", "dog") ~ "critical", # if the word column says cat or dog, write critical 
    TRUE ~ "not critical") # in all other cases, write not critical into the position column
    )

# select is useful to check if this worked correctly
spr_catdog %>% 
  select(word, position)
output
## # A tibble: 3,024 × 2
##    word      position    
##    <chr>     <chr>       
##  1 mary      not critical
##  2 saw       not critical
##  3 a         not critical
##  4 beautiful not critical
##  5 siamese   not critical
##  6 cat       critical    
##  7 on        not critical
##  8 her       not critical
##  9 way       not critical
## 10 home      not critical
## # ℹ 3,014 more rows

As another example, let’s try “dichotomizing” response time, i.e. turning them into a categorical variable (note that this is not recommended!):

spr_catdog <- spr_catdog %>% 
  mutate(RT_categorical = case_when(
    RT < 100 ~ "short", 
    TRUE ~ "long"
  ))

We could also add another condition (or as many as you want!)

spr_catdog <- spr_catdog %>% 
  mutate(RT_categotical = case_when(
    RT < 100 ~ "short",
    RT > 500 ~ "long",
    .default = "normal"
  )) 

Other uses of case_when():

- evaluating participants’ answers as correct or incorrect

- simplify a column (e.g., simplify people’s occupations into student or not student)

- classify transcription attempts as same length as correct answer, longer, shorter

Try it

  1. Challenge: Take the word_num column of spr_catdog. This shows the position the word had in the sentence, where the first word of the sentence is 1, the second word is 2, and so on. Use a case_when statement to make a new column called word_num_cat and dichotomize the variable into three conditions: “beginning” for the first 3 words, “middle” for words # 4 - 7 and “end” for anything else.

  2. Challenge: Using the ELP dataset, first make a new column entitled “freq_nouns” with the condition that it lists “frequent noun” for all entries that are both nouns and have a SUBTL frequency above 50. Then, create a new dataset entitled “ELP_freq_nouns” with ONLY these frequent nouns.

Separate

For the next two examples, we’ll use a new dataset, called animal_corpus. Let’s take a look:

(If you are wondering why we are using read_csv2(), it’s because this dataset is semicolon separated!)

animal_corpus <- read_csv2("data/cat_dog_corpus_data.csv")

head(animal_corpus)
output
## # A tibble: 6 × 3
##    freq collocate    animal
##   <dbl> <chr>        <chr> 
## 1   177 mad_jj       dog   
## 2    45 ginger_jj    cat   
## 3   175 mad_jj       dog   
## 4   251 dangerous_jj dog   
## 5    33 siamese_jj   cat   
## 6   164 mad_jj       dog

This contains corpus data of “cat” and “dog” together with the words that precede “cat” and “dog”. These are called “collocates” and, in our example, also include part of speech tags (the format is word_tag).

separate()

However, the collocate and the POS are in the same column.

The command we need here is: separate(). It takes the following arguments:

- data: our dataframe, we’ll pipe it

- col: which column needs to be separated

- into: a vector that contains the names of the new columns

- sep: which symbol separates the values

- remove (optional): by default, the original column will be deleted. Set to FALSE to keep it.

animal_corpus %>% 
  separate(col = collocate, 
           into = c("coll", "POS"), # more than two columns is also possible
           sep = "_")
output
## # A tibble: 263 × 4
##     freq coll      POS   animal
##    <dbl> <chr>     <chr> <chr> 
##  1   177 mad       jj    dog   
##  2    45 ginger    jj    cat   
##  3   175 mad       jj    dog   
##  4   251 dangerous jj    dog   
##  5    33 siamese   jj    cat   
##  6   164 mad       jj    dog   
##  7    62 because   cs    cat   
##  8   194 barking   jj    dog   
##  9   224 lap       nn1   dog   
## 10   173 mad       jj    dog   
## # ℹ 253 more rows

By default, the input column is deleted, but we can add a line to keep it:

animal_corpus %>% 
  separate(col = collocate, 
           into = c("coll", "POS"), 
           sep = "_",
           remove = FALSE)
output
## # A tibble: 263 × 5
##     freq collocate    coll      POS   animal
##    <dbl> <chr>        <chr>     <chr> <chr> 
##  1   177 mad_jj       mad       jj    dog   
##  2    45 ginger_jj    ginger    jj    cat   
##  3   175 mad_jj       mad       jj    dog   
##  4   251 dangerous_jj dangerous jj    dog   
##  5    33 siamese_jj   siamese   jj    cat   
##  6   164 mad_jj       mad       jj    dog   
##  7    62 because_cs   because   cs    cat   
##  8   194 barking_jj   barking   jj    dog   
##  9   224 lap_nn1      lap       nn1   dog   
## 10   173 mad_jj       mad       jj    dog   
## # ℹ 253 more rows

Let’s overwrite the original “untidy” data frame with the new tidy version:

animal_corpus <- animal_corpus %>% 
  separate(col = collocate, 
           into = c("coll", "POS"), 
           sep = "_")

animal_corpus
output
## # A tibble: 263 × 4
##     freq coll      POS   animal
##    <dbl> <chr>     <chr> <chr> 
##  1   177 mad       jj    dog   
##  2    45 ginger    jj    cat   
##  3   175 mad       jj    dog   
##  4   251 dangerous jj    dog   
##  5    33 siamese   jj    cat   
##  6   164 mad       jj    dog   
##  7    62 because   cs    cat   
##  8   194 barking   jj    dog   
##  9   224 lap       nn1   dog   
## 10   173 mad       jj    dog   
## # ℹ 253 more rows

Reshaping data: pivoting

Let’s look into how to change the shape of your data. There are two options here:

- making your data “longer”, i.e. increase the number of rows and decrease the number of columns: pivot_longer()

- making your data “wider”, i.e. decrease the number of rows and increase the number of columns - not (as) common for tidying but for creating summary tables: pivot_wider()

pivot_longer()

Let’s take another look at some questionnaire data. The df rec_survery contains four participants who each listened to four recordings and tried to transcribe what they heard as accurately as possible.

rec_survey <- read_csv2("data/recordings_survey.csv")

The data from each person is represented on one line. The columns rec1 to rec4 contain the transcription scores (which range from 0 = completely wrong to 1 = completely accurate).

rec_survey %>% 
  head()
output
## # A tibble: 4 × 8
##   participant   age student proficiency  rec1  rec2  rec3  rec4
##   <chr>       <dbl> <chr>         <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 p1             23 yes               5 0.742 0.693 0.83  0.824
## 2 p2             27 yes               6 0.729 0.824 0.921 0.914
## 3 p3             26 no                3 0.635 0.619 0.724 0.634
## 4 p4             31 no                4 0.526 0.602 0.581 0.52

Here, the issue is that the transcription scores are spread across four different columns. For many graphing and modeling purposes, though, we would need a column (which we’ll call “recording”), which identifies the recording and contains either rec1, rec2, rec3, or rec4, and another column that contains the score.

The basic pivot_longer() arguments are:

- cols: which columns should be reshaped?

- names_to: what to name the column that will contain what is now the column names

- values_to: what to name the column that will contain what is now in the cells

You can always try it without using the names_to and values_to arguments first, and then pick the names after you look at the preview!

rec_survey %>% 
  pivot_longer(cols = c("rec1", "rec2", "rec3", "rec4"))
output
## # A tibble: 16 × 6
##    participant   age student proficiency name  value
##    <chr>       <dbl> <chr>         <dbl> <chr> <dbl>
##  1 p1             23 yes               5 rec1  0.742
##  2 p1             23 yes               5 rec2  0.693
##  3 p1             23 yes               5 rec3  0.83 
##  4 p1             23 yes               5 rec4  0.824
##  5 p2             27 yes               6 rec1  0.729
##  6 p2             27 yes               6 rec2  0.824
##  7 p2             27 yes               6 rec3  0.921
##  8 p2             27 yes               6 rec4  0.914
##  9 p3             26 no                3 rec1  0.635
## 10 p3             26 no                3 rec2  0.619
## 11 p3             26 no                3 rec3  0.724
## 12 p3             26 no                3 rec4  0.634
## 13 p4             31 no                4 rec1  0.526
## 14 p4             31 no                4 rec2  0.602
## 15 p4             31 no                4 rec3  0.581
## 16 p4             31 no                4 rec4  0.52
rec_survey %>% 
  pivot_longer(cols = c("rec1", "rec2", "rec3", "rec4"),
               names_to = "recording",
               values_to = "score")
output
## # A tibble: 16 × 6
##    participant   age student proficiency recording score
##    <chr>       <dbl> <chr>         <dbl> <chr>     <dbl>
##  1 p1             23 yes               5 rec1      0.742
##  2 p1             23 yes               5 rec2      0.693
##  3 p1             23 yes               5 rec3      0.83 
##  4 p1             23 yes               5 rec4      0.824
##  5 p2             27 yes               6 rec1      0.729
##  6 p2             27 yes               6 rec2      0.824
##  7 p2             27 yes               6 rec3      0.921
##  8 p2             27 yes               6 rec4      0.914
##  9 p3             26 no                3 rec1      0.635
## 10 p3             26 no                3 rec2      0.619
## 11 p3             26 no                3 rec3      0.724
## 12 p3             26 no                3 rec4      0.634
## 13 p4             31 no                4 rec1      0.526
## 14 p4             31 no                4 rec2      0.602
## 15 p4             31 no                4 rec3      0.581
## 16 p4             31 no                4 rec4      0.52

The cols argument accepts “select syntax”, i.e. anything that works in select() like starts_with(), :, -, etc.

Because pivoting represents a major change compared to the original data frame, it’s often a good idea to save it under a different name:

rec_survey_long <- rec_survey %>% 
  pivot_longer(cols = rec1:rec4,
               names_to = "recording",
               values_to = "score")

rec_survey_long
output
## # A tibble: 16 × 6
##    participant   age student proficiency recording score
##    <chr>       <dbl> <chr>         <dbl> <chr>     <dbl>
##  1 p1             23 yes               5 rec1      0.742
##  2 p1             23 yes               5 rec2      0.693
##  3 p1             23 yes               5 rec3      0.83 
##  4 p1             23 yes               5 rec4      0.824
##  5 p2             27 yes               6 rec1      0.729
##  6 p2             27 yes               6 rec2      0.824
##  7 p2             27 yes               6 rec3      0.921
##  8 p2             27 yes               6 rec4      0.914
##  9 p3             26 no                3 rec1      0.635
## 10 p3             26 no                3 rec2      0.619
## 11 p3             26 no                3 rec3      0.724
## 12 p3             26 no                3 rec4      0.634
## 13 p4             31 no                4 rec1      0.526
## 14 p4             31 no                4 rec2      0.602
## 15 p4             31 no                4 rec3      0.581
## 16 p4             31 no                4 rec4      0.52

Another example: This next file contains the averages of acceptability judgements for the sentences we used in the experiment (we’d expect that sentences which contain mismatching collocates such as “barking cat” are rated as less acceptable than sentences with matching collocates).

acc_judge_original <- read_csv("data/acceptability_judgements.csv")

acc_judge_original %>% 
  head()
output
## # A tibble: 1 × 20
##   Mary saw a beautiful Siamese c…¹ Joshua said that pur…² There was a contente…³
##                              <dbl>                  <dbl>                  <dbl>
## 1                              4.8                    5.3                    5.1
## # ℹ abbreviated names:
## #   ¹​`Mary saw a beautiful Siamese cat on her way home from work.`,
## #   ²​`Joshua said that purring cats are one of the most relaxing things to him.`,
## #   ³​`There was a contented cat looking out from the bookstore window.`
## # ℹ 17 more variables:
## #   `The child told me that ginger cats are by far the cutest of them all.` <dbl>,
## #   `A large tom cat was often seen hanging out near the butcher shop.` <dbl>, …

We can see that this data is in a very wide format - each column name contains one of the sentences, and the only row consists of the average acceptability judgements. Even this dataset can be saved!

What we’d like instead is a column that contains all the sentences and another column that contains all the averages. The command everything() will be very useful here!

(acc_judge <- acc_judge_original %>% 
  pivot_longer(
    cols = everything(), # we want to reshape the entire data, so all variables are concerned
    names_to = "sentence",
    values_to = "rating"
  ))
output
## # A tibble: 20 × 2
##    sentence                                                               rating
##    <chr>                                                                   <dbl>
##  1 Mary saw a beautiful Siamese cat on her way home from work.               4.8
##  2 Joshua said that purring cats are one of the most relaxing things to …    5.3
##  3 There was a contented cat looking out from the bookstore window.          5.1
##  4 The child told me that ginger cats are by far the cutest of them all.     4.9
##  5 A large tom cat was often seen hanging out near the butcher shop.         4.3
##  6 My closest neighbor has a guide dog to help him navigate.                 5.5
##  7 The students were worried that a mad dog had been spotted in town.        5.6
##  8 Although the barking dog was noisy, he was quite cute.                    5.3
##  9 My grandmother had a lap dog that she loved very much.                    5.7
## 10 Tracey was the best dog Jenny's family ever had.                          5.3
## 11 Lisa has heard of Siamese dogs, but has never seen one.                   3.5
## 12 It has been reported that purring dogs are not uncommon.                  1.3
## 13 My best friend Jerry has a contented dog that loves to lounge around.     4.2
## 14 On a walk through the woods, Randy saw a ginger dog that was quite re…    3.2
## 15 The baker said that he saw a tom dog on the loose by the bay.             1.5
## 16 Louise was interested to hear how the guide cat was adapting to his n…    1.2
## 17 The entire city was talking about a mad cat that was seen digging thr…    3.9
## 18 Poppy read in the newspaper that a barking cat had been identified.       1.1
## 19 The doctor had a lap cat to keep her company after her shifts.            4.5
## 20 Georgie was the best cat that had ever taken part in the contest.         4.1

pivot_wider()

While pivot_longer() is often used to tidy data, its opposite pivot_wider() is more common when creating and reformatting summary tables. For example, let’s read in a summary table of “cat” and “dog” with the frequencies of their collocates:

animal_corpus <- read_csv2("data/cat_dog_collocates.csv")

animal_corpus
output
## # A tibble: 15 × 3
##    animal coll              n
##    <chr>  <chr>         <dbl>
##  1 cat    because           5
##  2 cat    contented         4
##  3 cat    ginger            8
##  4 cat    playing           4
##  5 cat    purring           6
##  6 cat    siamese          17
##  7 cat    tom              17
##  8 cat    tortoiseshell     9
##  9 dog    barking           9
## 10 dog    best              4
## 11 dog    dangerous        24
## 12 dog    guide            68
## 13 dog    has              19
## 14 dog    lap              17
## 15 dog    mad              52

Next, we’d like to reshape this data into a more typical contingency table-like format, with “cat” and “dog” as row labels, and the collocates as column labels.

We can use pivot_wider() to achieve this. Its main arguments are:

- names_from: where should the new column names come from?

- values_from: where should the corresponding values come from?

animal_corpus  %>% 
  pivot_wider(
    names_from = coll,
    values_from = n
  )
output
## # A tibble: 2 × 16
##   animal because contented ginger playing purring siamese   tom tortoiseshell
##   <chr>    <dbl>     <dbl>  <dbl>   <dbl>   <dbl>   <dbl> <dbl>         <dbl>
## 1 cat          5         4      8       4       6      17    17             9
## 2 dog         NA        NA     NA      NA      NA      NA    NA            NA
## # ℹ 7 more variables: barking <dbl>, best <dbl>, dangerous <dbl>, guide <dbl>,
## #   has <dbl>, lap <dbl>, mad <dbl>

We can also take those unattested occurrences and fill the NAs with 0s.

animal_corpus  %>% 
  pivot_wider(
    names_from = coll,
    values_from = n,
    values_fill = 0
  )
output
## # A tibble: 2 × 16
##   animal because contented ginger playing purring siamese   tom tortoiseshell
##   <chr>    <dbl>     <dbl>  <dbl>   <dbl>   <dbl>   <dbl> <dbl>         <dbl>
## 1 cat          5         4      8       4       6      17    17             9
## 2 dog          0         0      0       0       0       0     0             0
## # ℹ 7 more variables: barking <dbl>, best <dbl>, dangerous <dbl>, guide <dbl>,
## #   has <dbl>, lap <dbl>, mad <dbl>

Both pivot_longer() and pivot_wider() have a range of additional options in case your data looks like it needs pivoting, but the default settings don’t quite work. Have a look at the vignette: https://tidyr.tidyverse.org/articles/pivot.html

Joining tables

Back to the self-paced reading data:

head(spr_catdog)
output
## # A tibble: 6 × 12
##   participant sentence_num condition animal word         RT sentence    word_num
##   <fct>              <dbl> <fct>     <chr>  <chr>     <dbl> <chr>          <dbl>
## 1 Par_A                  4 match     cat    mary         64 Mary saw a…        1
## 2 Par_A                  4 match     cat    saw         127 Mary saw a…        2
## 3 Par_A                  4 match     cat    a            72 Mary saw a…        3
## 4 Par_A                  4 match     cat    beautiful   228 Mary saw a…        4
## 5 Par_A                  4 match     cat    siamese     295 Mary saw a…        5
## 6 Par_A                  4 match     cat    cat          15 Mary saw a…        6
## # ℹ 4 more variables: word_length <int>, position <chr>, RT_categorical <chr>,
## #   RT_categotical <chr>

We have averages of acceptability judgements for the sentences we used in the experiment in one of the files we read in and pivoted earlier. We’d expect that sentences which contain mismatching collocates such as “barking cat” are rated as less acceptable than sentences with matching collocates.

acc_judge
output
## # A tibble: 20 × 2
##    sentence                                                               rating
##    <chr>                                                                   <dbl>
##  1 Mary saw a beautiful Siamese cat on her way home from work.               4.8
##  2 Joshua said that purring cats are one of the most relaxing things to …    5.3
##  3 There was a contented cat looking out from the bookstore window.          5.1
##  4 The child told me that ginger cats are by far the cutest of them all.     4.9
##  5 A large tom cat was often seen hanging out near the butcher shop.         4.3
##  6 My closest neighbor has a guide dog to help him navigate.                 5.5
##  7 The students were worried that a mad dog had been spotted in town.        5.6
##  8 Although the barking dog was noisy, he was quite cute.                    5.3
##  9 My grandmother had a lap dog that she loved very much.                    5.7
## 10 Tracey was the best dog Jenny's family ever had.                          5.3
## 11 Lisa has heard of Siamese dogs, but has never seen one.                   3.5
## 12 It has been reported that purring dogs are not uncommon.                  1.3
## 13 My best friend Jerry has a contented dog that loves to lounge around.     4.2
## 14 On a walk through the woods, Randy saw a ginger dog that was quite re…    3.2
## 15 The baker said that he saw a tom dog on the loose by the bay.             1.5
## 16 Louise was interested to hear how the guide cat was adapting to his n…    1.2
## 17 The entire city was talking about a mad cat that was seen digging thr…    3.9
## 18 Poppy read in the newspaper that a barking cat had been identified.       1.1
## 19 The doctor had a lap cat to keep her company after her shifts.            4.5
## 20 Georgie was the best cat that had ever taken part in the contest.         4.1

Ideally, we want this information to be in the same data frame as the self-paced reading data. Let’s add the acceptability judgement data to the SPR data: we want to “join” the two datasets into one, so each sentence should be matched up with its average rating.

There are several join commands, which differ in how they match up datasets and which cases are kept or discarded.

Their syntax is: xxxx_join(dataframe1, dataframe2, by = “column that is present in both dfs”)

The column “sentence” is what we need to match up acceptability judgements with the SPR data.

I’ll use left_join, because if the sentence doesn’t show up in the spr_catdog dataset, we don’t need the rating! A full_join would also be a safe option: this would leave an empty row with just the sentence and rating if there is one missing from the spr_catdog data

left_join(spr_catdog, acc_judge, 
                     by = "sentence") # which column should be used to match up the two data frames?
output
## # A tibble: 3,024 × 13
##    participant sentence_num condition animal word         RT sentence   word_num
##    <fct>              <dbl> <fct>     <chr>  <chr>     <dbl> <chr>         <dbl>
##  1 Par_A                  4 match     cat    mary         64 Mary saw …        1
##  2 Par_A                  4 match     cat    saw         127 Mary saw …        2
##  3 Par_A                  4 match     cat    a            72 Mary saw …        3
##  4 Par_A                  4 match     cat    beautiful   228 Mary saw …        4
##  5 Par_A                  4 match     cat    siamese     295 Mary saw …        5
##  6 Par_A                  4 match     cat    cat          15 Mary saw …        6
##  7 Par_A                  4 match     cat    on           78 Mary saw …        7
##  8 Par_A                  4 match     cat    her         121 Mary saw …        8
##  9 Par_A                  4 match     cat    way          85 Mary saw …        9
## 10 Par_A                  4 match     cat    home        161 Mary saw …       10
## # ℹ 3,014 more rows
## # ℹ 5 more variables: word_length <int>, position <chr>, RT_categorical <chr>,
## #   RT_categotical <chr>, rating <dbl>

If you don’t specify the variable, R will check if there’s a variable with the same name in both data frames:

left_join(spr_catdog, acc_judge)
output
## # A tibble: 3,024 × 13
##    participant sentence_num condition animal word         RT sentence   word_num
##    <fct>              <dbl> <fct>     <chr>  <chr>     <dbl> <chr>         <dbl>
##  1 Par_A                  4 match     cat    mary         64 Mary saw …        1
##  2 Par_A                  4 match     cat    saw         127 Mary saw …        2
##  3 Par_A                  4 match     cat    a            72 Mary saw …        3
##  4 Par_A                  4 match     cat    beautiful   228 Mary saw …        4
##  5 Par_A                  4 match     cat    siamese     295 Mary saw …        5
##  6 Par_A                  4 match     cat    cat          15 Mary saw …        6
##  7 Par_A                  4 match     cat    on           78 Mary saw …        7
##  8 Par_A                  4 match     cat    her         121 Mary saw …        8
##  9 Par_A                  4 match     cat    way          85 Mary saw …        9
## 10 Par_A                  4 match     cat    home        161 Mary saw …       10
## # ℹ 3,014 more rows
## # ℹ 5 more variables: word_length <int>, position <chr>, RT_categorical <chr>,
## #   RT_categotical <chr>, rating <dbl>

When joining, it’s often a good idea to create a new dataframe instead of overwriting any of the old ones.

spr_acc <- left_join(spr_catdog, acc_judge, 
                     by = "sentence")

A quick glance at the Environment tab confirms that we now have a dataframe with the same number of rows (“observations”) but one more column, just as we want. But when joining data, plenty of things can go wrong, so let’s look at the data:

spr_acc %>% 
  select(sentence, rating) %>% 
  distinct()
output
## # A tibble: 20 × 2
##    sentence                                                               rating
##    <chr>                                                                   <dbl>
##  1 Mary saw a beautiful Siamese cat on her way home from work.               4.8
##  2 Joshua said that purring cats are one of the most relaxing things to …    5.3
##  3 There was a contented cat looking out from the bookstore window.          5.1
##  4 The child told me that ginger cats are by far the cutest of them all.     4.9
##  5 A large tom cat was often seen hanging out near the butcher shop.         4.3
##  6 My closest neighbor has a guide dog to help him navigate.                 5.5
##  7 The students were worried that a mad dog had been spotted in town.        5.6
##  8 Although the barking dog was noisy, he was quite cute.                    5.3
##  9 My grandmother had a lap dog that she loved very much.                    5.7
## 10 Tracey was the best dog Jenny's family ever had.                          5.3
## 11 Lisa has heard of Siamese dogs, but has never seen one.                   3.5
## 12 It has been reported that purring dogs are not uncommon.                  1.3
## 13 My best friend Jerry has a contented dog that loves to lounge around.     4.2
## 14 On a walk through the woods, Randy saw a ginger dog that was quite re…    3.2
## 15 The baker said that he saw a tom dog on the loose by the bay.             1.5
## 16 Louise was interested to hear how the guide cat was adapting to his n…    1.2
## 17 The entire city was talking about a mad cat that was seen digging thr…    3.9
## 18 Poppy read in the newspaper that a barking cat had been identified.       1.1
## 19 The doctor had a lap cat to keep her company after her shifts.            4.5
## 20 Georgie was the best cat that had ever taken part in the contest.         4.1

Looks good!

As a final example, we’ll read in one last, very simple, dataset. It contains the raw frequencies of each of the words participants read in the SPR experiment (word frequency tends to be a good predictor of reading times: more common words are usually read faster). Let’s have a look:

corpus_frequencies <- read_tsv("data/corpus_frequencies.txt")

head(corpus_frequencies)
output
## # A tibble: 6 × 2
##   token  frequency
##   <chr>      <dbl>
## 1 of       5973437
## 2 one      3009801
## 3 things   2565070
## 4 a        2136923
## 5 all      2136923
## 6 in       1914216

We’d like to add the word frequency to our data, so the column to match by is “word” in the spr data but it’s called “token” in the corpus data. To join this, we can either rename one of these or use a vector in by() which lists the two column names as they appear in the dataframes:

Syntax: add the argument by = c(“colname_df1” = “colname_df2”)

spr_acc_freq <- full_join(spr_acc, corpus_frequencies, by = c("word" = "token"))

spr_acc_freq
output
## # A tibble: 3,024 × 14
##    participant sentence_num condition animal word         RT sentence   word_num
##    <fct>              <dbl> <fct>     <chr>  <chr>     <dbl> <chr>         <dbl>
##  1 Par_A                  4 match     cat    mary         64 Mary saw …        1
##  2 Par_A                  4 match     cat    saw         127 Mary saw …        2
##  3 Par_A                  4 match     cat    a            72 Mary saw …        3
##  4 Par_A                  4 match     cat    beautiful   228 Mary saw …        4
##  5 Par_A                  4 match     cat    siamese     295 Mary saw …        5
##  6 Par_A                  4 match     cat    cat          15 Mary saw …        6
##  7 Par_A                  4 match     cat    on           78 Mary saw …        7
##  8 Par_A                  4 match     cat    her         121 Mary saw …        8
##  9 Par_A                  4 match     cat    way          85 Mary saw …        9
## 10 Par_A                  4 match     cat    home        161 Mary saw …       10
## # ℹ 3,014 more rows
## # ℹ 6 more variables: word_length <int>, position <chr>, RT_categorical <chr>,
## #   RT_categotical <chr>, rating <dbl>, frequency <dbl>

If the join worked as intended, we can remove the input data frames to declutter our Environment:

rm(spr_acc, corpus_frequencies)

Other options:

- bind_rows() to combine data frames rowwise

- bind_cols() to combine data frames columnwise

Try it

  1. Take a look at the following dataset which contains more information on the participants in the SPR experiment and join it the spr_acc_freq data. Save the new dataset as spr_full
animal_survey <- read_csv("data/animal_survey.csv")
animal_survey
output
## # A tibble: 12 × 11
##    participant   age nationality animals_cute animals_look animals_pet
##    <chr>       <dbl> <chr>              <dbl>        <dbl>       <dbl>
##  1 Par_A          24 US                     2            4           1
##  2 Par_B          28 US                     4            3           3
##  3 Par_C          18 UK                     4            4           5
##  4 Par_D          32 US                     3            3           4
##  5 Par_E          36 UK                     1            1           1
##  6 Par_F          25 UK                     5            5           5
##  7 Par_G          21 UK                     4            4           3
##  8 Par_H          68 US                     3            3           1
##  9 Par_I          34 DE                     5            5           5
## 10 Par_J          22 UK                     5            4           4
## 11 Par_K          28 US                     3            3           3
## 12 Par_M          33 US                     4            4           5
## # ℹ 5 more variables: animals_have <dbl>, gender <chr>, education <dbl>,
## #   cats <dbl>, dogs <dbl>
spr_acc_freq 
output
## # A tibble: 3,024 × 14
##    participant sentence_num condition animal word         RT sentence   word_num
##    <fct>              <dbl> <fct>     <chr>  <chr>     <dbl> <chr>         <dbl>
##  1 Par_A                  4 match     cat    mary         64 Mary saw …        1
##  2 Par_A                  4 match     cat    saw         127 Mary saw …        2
##  3 Par_A                  4 match     cat    a            72 Mary saw …        3
##  4 Par_A                  4 match     cat    beautiful   228 Mary saw …        4
##  5 Par_A                  4 match     cat    siamese     295 Mary saw …        5
##  6 Par_A                  4 match     cat    cat          15 Mary saw …        6
##  7 Par_A                  4 match     cat    on           78 Mary saw …        7
##  8 Par_A                  4 match     cat    her         121 Mary saw …        8
##  9 Par_A                  4 match     cat    way          85 Mary saw …        9
## 10 Par_A                  4 match     cat    home        161 Mary saw …       10
## # ℹ 3,014 more rows
## # ℹ 6 more variables: word_length <int>, position <chr>, RT_categorical <chr>,
## #   RT_categotical <chr>, rating <dbl>, frequency <dbl>

Exporting data

We’ve added a few new columns to our data, which contain the transformed variables. We might want to export this and save it to, e.g., a csv file to read back in later.

To write files to disk, you also need to select which separator you want to use (commas, semicolons, or tabs).

CSVs are often saved as .csv (this is also openable in Excel) and TSVs are often saved as .txt

You can save CSVs and TSVs like so:

  • write_csv(df, "name_df.csv") comma-separated

  • write_csv2(df, "name_df.csv") semicolon-separated

  • write_tsv(df, "name_df.txt") tab-separated

Make sure to not overwrite the original file! Pick a different name:

write_csv(spr_acc_freq, "data/spr_catdog_full.csv")

Remember, something like data type in R (ex: character or factor) can’t be saved into one of these files – it just saves the values separated by the separator (and the column names). So you’ll still have to do the conversions next time you read the file into R.

Recap

  • the pipe %>% takes the item before it (usually the dataframe) and feeds it to the following command as the first argument -> allows you to chain several commands in a readable way
  • rename() and arrange() columns
  • subsets: select() to pick column(s), filter() to pick rows
    • conditions with >, >=, <, <=, ==, !=, %in%
    • logical operators: and & as well as or | to use several conditions
  • distinct() to keep unique rows
  • mutate() to create new columns or change existing columns
    • change data type
    • change factor labels
    • case_when() statements: create columns based on data in other columns
  • Two common functions for tidying data are:
    • separate() if you have two variables in one column
    • pivot_longer() if your data is in wide format but needs to be long
  • And pivot_wider()
    • usually not helpful for tidying data but can be useful for reformatting summary and/or contingency tables
  • Joining data from two data frames into one to add additional information
  • Exporting data from R into different file formats

Note that all these functions have additional arguments and options, so if, in order to clean data, you think you need a function but it doesn’t quite work, take a look at the documentation!